Fix: stabilize test after distributed training completion #375

stop1one · 2025-10-01T07:34:15Z

Description

Related Issues

Fixes #316 #374

Summary of Changes

This PR fixes two problems for distributed training when run_test=True:

Synchronization before testing
- Added torch.distributed.barrier() when args.distributed is enabled, right after training and before testing.
- Ensures all processes wait until rank 0 has finished saving checkpoint_best_total.pth before moving on to testing.
Correct checkpoint loading in DDP
- Changed from model.load_state_dict(best_state_dict) to model_without_ddp.load_state_dict(best_state_dict)
- Prevents key mismatch errors when loading checkpoints in distributed environments.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

train.py :

from rfdetr import RFDETRLarge

model = RFDETRLarge()

model.train(
    dataset_dir=<DATASET_PATH>,
    epochs=30,
    batch_size=8
    grad_accum_steps=2,
    lr=1e-4,
    output_dir=<OUTPUT_PATH>,
)

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py

CLAassistant · 2025-10-01T07:34:23Z

All committers have signed the CLA.

stop1one · 2025-11-06T01:49:06Z

Hi @probicheaux ,
I just realized I hadn’t signed the CLA earlier — that’s done now ✅
Would you mind taking another look at this PR when you have a moment?
Thanks again for your time!

Fix: stabilize test after distributed training completion

db961bb

stop1one requested review from Matvezy, SkalskiP, isaacrob-roboflow and probicheaux as code owners October 1, 2025 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: stabilize test after distributed training completion #375

Fix: stabilize test after distributed training completion #375

stop1one commented Oct 1, 2025

Uh oh!

CLAassistant commented Oct 1, 2025 •

edited

Loading

Uh oh!

stop1one commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: stabilize test after distributed training completion #375

Are you sure you want to change the base?

Fix: stabilize test after distributed training completion #375

Conversation

stop1one commented Oct 1, 2025

Description

Related Issues

Summary of Changes

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Uh oh!

CLAassistant commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stop1one commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Oct 1, 2025 •

edited

Loading